Federated Search
   HOME

TheInfoList



OR:

Federated search retrieves information from a variety of sources via a search application built on top of one or more search engines. A user makes a single query request which is distributed to the
search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
s, databases or other query engines participating in the federation. The federated search then aggregates the results that are received from the search engines for presentation to the user. Federated search can be used to integrate disparate information resources within a single large organization ("enterprise") or for the entire web. Federated search, unlike distributed search, requires centralized coordination of the searchable resources. This involves both coordination of the queries transmitted to the individual search engines and fusion of the search results returned by each of them.


Purpose

Federated search came about to meet the need of searching multiple disparate content sources with one query. This allows a user to search multiple databases at once in real time, arrange the results from the various databases into a useful form and then present the results to the user. As such, it is an information aggregation, or integration approach - it provides single point access to many information resources, and typically returns the data in a standard or partially homogenized form. Other approaches include constructing an
Enterprise data warehouse In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integra ...
,
Data lake A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transform ...
, or Data hub. Federated Search queries many times in many ways (each source is queried separately) where other approaches import and transform data many times, typically in overnight batch processes. Federated search provides a real-time view of all sources (to the extent they are all online and available). In industrial search engines, such as
LinkedIn LinkedIn () is an American business and employment-oriented online service that operates via websites and mobile apps. Launched on May 5, 2003, the platform is primarily used for professional networking and career development, and allows job se ...
, federated search is used to personalize vertical preference for ambiguous queries. For instance, when a user issues a query like "machine learning" on LinkedIn, he or she could mean to search for people with machine learning skill, jobs requiring machine learning skill or content about the topic. In such cases, federated search could exploit
user intent User intent, otherwise known as query intent or search intent, is the identification and categorization of what a user online intended or wanted to find when they typed their search terms into an online web search engine for the purpose of search ...
(e.g., hiring, job seeking or content consuming) to personalize the vertical order for each individual user.


Process

As described by Peter Jacso (2004), federated searching consists of (1) transforming a query and broadcasting it to a group of disparate databases or other web resources, with the appropriate syntax, (2) merging the results collected from the databases, (3) presenting them in a succinct and unified format with minimal duplication, and (4) providing a means, performed either automatically or by the portal user, to sort the merged result set. Federated search portals, either commercial or
open access Open access (OA) is a set of principles and a range of practices through which research outputs are distributed online, free of access charges or other barriers. With open access strictly defined (according to the 2001 definition), or libre op ...
, generally search public access
bibliographic databases A bibliographic database is a database of bibliographic records, an organized digital collection of references to published literature, including journal and newspaper articles, conference proceedings, reports, government and legal publications, ...
, public access Web-based library catalogues (
OPAC The online public access catalog (OPAC), now frequently synonymous with '' library catalog'', is an online database of materials held by a library or group of libraries. Online catalogs have largely replaced the analog card catalogs previously ...
s), Web-based search engines like
Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
and/or open-access, government-operated or corporate data collections. These individual information sources send back to the portal's interface a list of results from the search query. The user can review this hit list. Some portals will merely screen scrape the actual database results and not directly allow a user to enter the information source's application. More sophisticated ones will de-dupe the results list by merging and removing duplicates. There are additional features available in many portals, but the basic idea is the same: to improve the accuracy and relevance of individual searches as well as reduce the amount of time required to search for resources. This process allows federated search some key advantages when compared with existing crawler-based search engines. Federated search need not place any requirements or burdens on owners of the individual information sources, other than handling increased traffic. Federated searches are inherently as current as the individual information sources, as they are searched in real time.


Implementation

One application of federated searching is the
metasearch engine A metasearch engine (or search aggregator) is an online information retrieval tool that uses the data of a web search engine to produce its own results. Metasearch engines take input from a user and immediately query search engines for results. S ...
. However, the metasearch approach does not overcome the shortcomings of the component search engines, such as incomplete indexes. Documents that are not indexed by search engines create what is known as the
deep Web The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engine programs. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Co ...
, or invisible Web.
Google Scholar Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes p ...
is one example of many projects trying to address this, by indexing electronic documents that search engines ignore. And the metasearch approach, like the underlying search engine technology, only works with information sources stored in electronic form. One of the main challenges of metasearch, is ensuring that the search query is compatible with the component search engines that are being federated and combined. When the search vocabulary or
data model A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be co ...
of the search system is different from the data model of one or more of the foreign target systems, the query must be translated into each of the foreign target systems. This can be done using simple data-element translation or may require
semantic translation Semantic translation is the process of using semantic information to aid in the translation of data in one representation or data model to another representation or data model. Semantic translation takes advantage of semantics that associate meani ...
. For example, if one search engine allows for quoting of exact strings or n-grams and another does not, the query must be translated to be compatible with each search engine. To translate a quoted exact string query, it can be broken down into a set of overlapping N-grams that are most likely to give the desired search results in each search engine. Another challenge faced in the implementation of federated search engines is scalability. It is difficult to maintain the performance, the response speed, of a federated search engine as it combines more and more information sources together. One implementation of federated search that has begun to address this issue is
WorldWideScience WorldWideScience.org is a global science search engine (Academic databases and search engines) designed to accelerate scientific discovery and progress by accelerating the sharing of scientific knowledge. Through a multilateral partnership, World ...
, hosted by the U.S. Department of Energy's
Office of Scientific and Technical Information The Office of Scientific and Technical Information (OSTI) is a component of the Office of Science within the U.S. Department of Energy (DOE). The '' Energy Policy Act'' PL 109–58, Section 982, called out the responsibility of OSTI: "The Secre ...
. WorldWideScience is composed of more than 40 information sources, several of which are federated search portals themselves. One such portal is Science.govScience.gov
/ref> which itself federates more than 30 information sources representing most of the R&D output of the U.S. Federal government. Science.gov returns its highest ranked results to WorldWideScience, which then merges and ranks these results with the search returned by the other information sources that comprise WorldWideScience. This approach of cascaded federated search enables large number of information sources to be searched via a single query. Another application
Sesam Sesam, SESAM or SeSaM may refer to: * SESAM (database), a relational database developed by Fujitsu Siemens * SESAM (FEM), a structural analysis software * Sesam (search engine), a Scandinavian internet search engine * SeSaM-Biotech GmbH, a biotech ...
running in both Norway and Sweden has been built on top of an open sourced platform specialised for federated search solutions. Sesat, an acronym for
Sesam Search Application Toolkit Sesam was a Scandinavian internet search engine developed by the media corporation Schibsted. It was available both in a Norwegian and Swedish version and was launched on 1 November 2005. By 2007 Sesam.no had 480,000 unique users and was among th ...
, is a platform that provides much of the framework and functionality required for handling parallel and pipelined searches and displaying them elegantly in a user interface, allowing engineers to focus on the index/database configuration tuning. To personalize vertical orders in federated search, LinkedIn search engine exploits the searcher's profile and recent activities to infer his or her intent, such as hiring, job seeking and content consuming, then uses the intent, along with many other signals, to rank vertical order that personally relevant to the individual searcher. SWIRL Search is an open source federated search engine, released under the Apache 2.0 license. It includes pre-built connectors to popular open source search engines, and re-ranks results using cosine vector similarity.


Challenges

When federated search is performed against secure data sources, the users' credentials must be passed on to each underlying search engine, so that appropriate security is maintained. If the user has different login credentials for different systems, there must be a means to map their login ID to each search engine's security domain. Another challenge is mapping results list navigators into a common form. Suppose 3 real-estate sites are searched, each provides a list of hyperlinked city names to click on, to see matches only in each city. Ideally these facets would be combined into one set, but that presents additional technical challenges.20+ Differences Between Internet vs. Enterprise Search - part 1
/ref> The system also needs to understand "next page" links if it's going to allow the user to page through the combined results. Some of this challenge of mapping to a common form can be solved if the federated resources support
linked open data In computing, linked data (often capitalized as Linked Data) is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but r ...
via RDF. Ontologies (rules) can be added to map results to common forms using that technology. Another challenge is sorting and scoring results. Each web resource has its own notion of relevance score, and may support some sorted results orders. Relevance varies greatly among "federates" in the search, so knowing how to interleave results to show the most relevant is difficult or impossible. Another challenge is robust query. Federated search may have to restrict itself to the minimal set of query capabilities that are common to all federates. E.g. if Google supports negation and quoted phrases, but science.gov does not, it will be impossible for the federated search to support negated, quoted phrases. Another challenge is availability and
timeout Time-out, Time Out, or timeout may refer to: Time * Time-out (sport), in various sports, a break in play, called by a team * Television timeout, a break in sporting action so that a commercial break may be taken * Timeout (computing), an enginee ...
. As the number of federates (federated sources) grows, the likelihood of one or more slow or offline federates becomes high. The federated search must decide when to consider a federate offline, or wait for a slow response. Response times will be dictated by the slowest federate of the bunch. Another challenge is development and testing within an enterprise (vs. on the public internet). Development groups should typically not hit live, production systems as they do regular work, much less intensive load testing. Also, some resources are secure, and should not be arbitrarily queried and exposed in development due to privacy and security concerns. Therefore, the development, testing and performance test environments must include installation and configuration for many sub-systems to allow safe, secure testing. Another challenge within an enterprise is HA/DR (
high-availability High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. Modernization has resulted in an increased reliance on these systems. Fo ...
and disaster recovery). For the overall federated system to be HA/DR, every sub-system must be HA/DR. Similarly, performance modeling and
capacity planning Capacity planning is the process of determining the production capacity needed by an organization to meet changing demands for its products. In the context of capacity planning, design capacity is the maximum amount of work that an organization ...
for the federated system requires modeling, planning and sometimes expansion of all federates. For the reasons above, within an enterprise, a data hub or
data lake A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transform ...
may be preferable, or a hybrid approach. Data hubs and lakes simplify development and access, but may incur some time lag before data is available (without special synchronizing logic). On the web, federation is more typical.


See also

*
Search aggregator A search aggregator is a type of metasearch engine which gathers results from multiple search engines simultaneously, typically through RSS search results. It combines user specified search feeds (parameterized RSS feeds which return search result ...
* Z39.50


References


Further reading


Federated Search 101. Linoski, Alexis, Walczyk, Tine, Library Journal, Summer 2008 Net Connect, Vol. 133
This content has been move

but you will need a remote access account through your local library to get the whole article. *Cox, Christopher N. Federated Search: Solution or Setback for Online Library Services. Binghamton, NY: Haworth Information Press, 2007
Table of ContentsFederated Search Primer. Lederman, S., AltSearchEngines, January 2009
This material has been reposte
here
on the blog of a commercial search engine company. * {{DEFAULTSORT:Federated Search Internet terminology Internet search algorithms Applications of distributed computing